Data cleaning example project
Fri 06 March 2020
The data used in this example project is the titanic dataset which is a very common and widely used dataset and therefore also quite uninteresting as subject to an extensive data analysis or machine learning. But the version of the dataset that I have required suits the purpose of illustrating my data cleaning capabilities very well because it consists of different parts that have to be put together and contains problematic values of different kinds which have to be handled.
Imports and display setup:
import pandas as pd
Connecting the different parts of the dataset¶
Connecting part 1 and part 2:
part_1_df = pd.read_excel("./data/datasæt del1.xlsx")
part_2_df = pd.read_csv("./data/datasæt del2 komma.csv")
titanic_df = pd.concat([part_1_df, part_2_df])
titanic_df.reset_index(drop=True, inplace=True)
titanic_df.tail()
Adding names to the dataframe:
names_txt = pd.read_csv('./data/navne.txt', sep=",", header=None)
names_column = names_txt.loc[1:,1].reset_index(drop=True)
titanic_df['name'] = names_column
titanic_df.head()
Setting case_id as the index of the dataframe:
titanic_df.set_index('case_id', inplace=True)
Data cleaning¶
Removing duplicates in the dataset:
print('Locating duplicates:')
display(titanic_df[titanic_df.duplicated()])
print('\n Duplicates before:')
display(titanic_df.loc[550:551])
titanic_df = titanic_df.drop_duplicates()
# Showing the duplicate
print('\n Duplicates after:')
display(titanic_df.loc[550:551])
Removing row which contains too high fare value:
# Change the type in the fare column to float
titanic_df['fare'].replace('.', 0.0, inplace=True)
numeric_fare = pd.to_numeric(titanic_df['fare'])
titanic_df.loc[:, 'fare'] = numeric_fare
# Finding the row with the maximum fare value and removing it from the dataframe
print('Max fare value before:')
display(titanic_df.loc[titanic_df['fare'] == titanic_df['fare'].max()])
titanic_df.drop(60, axis=0, inplace=True)
print('\n Max fare value after:')
display(titanic_df.loc[titanic_df['fare'] == titanic_df['fare'].max()])
Removing datasource where the death is hard to determine:
# Locate values which are different than 0 and 1 in the survived column
print('Values different than 0 or 1 in survived column before:')
display(titanic_df.loc[~titanic_df['survived'].isin([1,0])])
titanic_df.drop(325, axis=0, inplace=True)
print('\n Values different than 0 or 1 in survived column after:')
display(titanic_df.loc[~titanic_df['survived'].isin([1,0])])
Removing data scource where age is below 0:
print('Value in age column below 0 before:')
display(titanic_df.loc[titanic_df['age'] < 0])
titanic_df.drop(172, axis=0, inplace=True)
print('\n Value in age column below 0 after:')
display(titanic_df.loc[titanic_df['age'] < 0])
Further cleaning of data¶
Overview of the cleaned dataframe as it is now:
print('Summary info about the dataframe:\n')
display(titanic_df.info())
print('\n Data types:')
display(titanic_df.dtypes)
Turning sibsp (sibling/spouse) and parch (parent/child) columns into one column called family and removing the individual columns:
titanic_df['family_size'] = titanic_df['sibsp'] + titanic_df['parch'] + 1
titanic_df = titanic_df.drop('sibsp', axis=1)
titanic_df = titanic_df.drop('parch', axis=1)
display(titanic_df.head())
Locating NaN values in the dataset:
titanic_df.isnull().sum()
Removing the cabin, boat, and body columns because it contains way too many NaN values:
titanic_df.drop(columns='cabin', inplace=True)
titanic_df.drop(columns='boat', inplace=True)
titanic_df.drop(columns='body', inplace=True)
Replacing NaN values in age col with median values:
titanic_df['age'] = titanic_df['age'].fillna(titanic_df['age'].median())
Replacing NaN values in embarked col with embarked value with highest frequency:
display(titanic_df['embarked'].mode())
titanic_df['embarked'] = titanic_df['embarked'].fillna('S')
Creating adult col to indicate if the person is an adult or a child:
titanic_df['adult'] = 0
titanic_df['adult'][titanic_df['age'] >= 18] = 1
Renaming pclass column to something that is more interpretable:
titanic_df.rename(columns={'pclass': 'price_class'}, inplace=True)
The cleaned dataset now looks like this:
titanic_df.head()